Improving cross-domain n-gram language modelling with skipgrams
نویسندگان
چکیده
In this paper we improve over the hierarchical Pitman-Yor processes language model in a cross-domain setting by adding skipgrams as features. We find that adding skipgram features reduces the perplexity. This reduction is substantial when models are trained on a generic corpus and tested on domain-specific corpora. We also find that within-domain testing and crossdomain testing require different backoff strategies. We observe a 30-40% reduction in perplexity in a cross-domain language modelling task, and up to 6% reduction in a within-domain experiment, for both English and Flemish-Dutch.
منابع مشابه
A Supervised Approach for Sentiment Analysis using Skipgrams
We present a supervised hybrid approach for Sentiment Analysis in Twitter. A sentiment lexicon is built from a dataset, where each tweet is labelled with its overall polarity. In this work, skipgrams are used as information units (in addition to words and n-grams) to enrich the sentiment lexicon with combinations of words that are not adjacent in the text. This lexicon is employed in conjunctio...
متن کاملFinding Cross-Lingual Spelling Variants
Finding term translations as cross-lingual spelling variants on the fly is an important problem for cross-lingual information retrieval (CLIR). CLIR is typically approached by automatically translating a query into the target language. For an overview of cross-lingual information retrieval, see [1]. When automatically translating the query, specialized terminology is often missing from the tran...
متن کاملGPLSI: Supervised Sentiment Analysis in Twitter using Skipgrams
In this paper we describe the system submitted for the SemEval 2014 Task 9 (Sentiment Analysis in Twitter) Subtask B. Our contribution consists of a supervised approach using machine learning techniques, which uses the terms in the dataset as features. In this work we do not employ any external knowledge and resources. The novelty of our approach lies in the use of words, ngrams and skipgrams (...
متن کاملCross-Domain Dialogue Act Tagging
We present recent work in the area of Cross-Domain Dialogue Act (DA) tagging. We have previously reported on the use of a simple dialogue act classifier based on purely intra-utterance features — principally involving word n-gram cue phrases automatically generated from a training corpus. Such a classifier performs surprisingly well, rivalling scores obtained using far more sophisticated langua...
متن کاملLanguage modelling with hierarchical domains
A new method of using domains to improve a variable ngram language model is presented. Previous research on the use of domains using a 2-tier approach is extended through the addition of further hierarchical levels of domain abstraction. Our method is based upon the concept of domain association where semantically related domains exhibit similar n-gram distributions. Association allows us to cl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016